{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4b9be8bf",
   "metadata": {},
   "source": [
    "\n",
    "# Cardiovascular Study Biostatistics and Predictive Analytics\n",
    "\n",
    "This notebook demonstrates the analysis of the Cleveland Heart Disease dataset, covering hypothesis testing, linear regression, and logistic regression models.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0e8f74f",
   "metadata": {},
   "source": [
    "\n",
    "## Loading and Examining the Cleveland Dataset\n",
    "\n",
    "The Cleveland dataset is used to predict the presence of heart disease based on several patient characteristics. First, we load the data and examine its structure.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "243820d9",
   "metadata": {},
   "source": [
    "\n",
    "# Dataset Information and Download Links\n",
    "\n",
    "The analysis in this notebook utilizes the **Cleveland Heart Disease dataset**, which is publicly available.\n",
    "\n",
    "### Accessing the Dataset\n",
    "\n",
    "You can download the dataset from the following sources:\n",
    "\n",
    "1. **UCI Machine Learning Repository:**\n",
    "   - [Heart Disease Dataset - UCI](https://archive.ics.uci.edu/ml/datasets/Heart%2BDisease)\n",
    "   - Look for the file named `processed.cleveland.data`.\n",
    "\n",
    "2. **Kaggle:**\n",
    "   - [Heart Disease Cleveland UCI Dataset - Kaggle](https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci)\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- The dataset contains the following features:\n",
    "  - **age**: Age of the patient.\n",
    "  - **sex**: Gender of the patient (0 = Female, 1 = Male).\n",
    "  - **cp**: Chest pain type (categorical).\n",
    "  - **trestbps**: Resting blood pressure (mmHg).\n",
    "  - **chol**: Serum cholesterol (mg/dL).\n",
    "  - **fbs**: Fasting blood sugar > 120 mg/dL (1 = true, 0 = false).\n",
    "  - **restecg**: Resting electrocardiographic results (categorical).\n",
    "  - **thalach**: Maximum heart rate achieved.\n",
    "  - **exang**: Exercise-induced angina (1 = yes, 0 = no).\n",
    "  - **oldpeak**: ST depression induced by exercise relative to rest.\n",
    "  - **slope**: Slope of the peak exercise ST segment.\n",
    "  - **ca**: Number of major vessels (0–3) colored by fluoroscopy.\n",
    "  - **thal**: Thalassemia (categorical).\n",
    "  - **cad**: Presence of heart disease (0 = no presence, 1–4 = varying severity).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Handle missing values appropriately (`?` values in the dataset represent missing data).\n",
    "- Recode the target variable (`cad`) into binary: 0 = No CAD, 1 = CAD presence.\n",
    "- Refer to the [dataset documentation](https://archive.ics.uci.edu/ml/datasets/Heart%2BDisease) for more details.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00a23c2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Set the column names based on dataset documentation\n",
    "column_names = [\n",
    "    \"age\", \"sex\", \"cp\", \"trestbps\", \"chol\", \"fbs\", \"restecg\",\n",
    "    \"thalach\", \"exang\", \"oldpeak\", \"slope\", \"ca\", \"thal\", \"cad\"\n",
    "]\n",
    "\n",
    "# Load the dataset (replace with the actual file path)\n",
    "dataset = pd.read_csv(\n",
    "    r'C:\\Path\\to\\processed.cleveland.data',\n",
    "    names=column_names, header=None, na_values=[\"?\"]\n",
    ")\n",
    "\n",
    "# Display the first few rows\n",
    "print(dataset.head())\n",
    "\n",
    "# Dichotomize the 'cad' variable\n",
    "dataset['cad'] = np.where(dataset['cad'] > 0, 1, 0)\n",
    "\n",
    "# Group the dataset by 'cad' column\n",
    "grouped_data = dataset.groupby(\"cad\")\n",
    "\n",
    "# Calculate and display descriptive statistics\n",
    "statistics = grouped_data.describe()\n",
    "print(statistics.T)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec170cbd",
   "metadata": {},
   "source": [
    "\n",
    "## Hypothesis Testing: Age Differences Between CAD and Non-CAD Subjects\n",
    "\n",
    "Using a t-test, we check if the mean age differs significantly between subjects with and without CAD.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2c49b1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from scipy.stats import ttest_ind\n",
    "\n",
    "# Separate data for CAD and non-CAD groups\n",
    "datacad = dataset[dataset['cad'] == 1]\n",
    "datacontrol = dataset[dataset['cad'] == 0]\n",
    "\n",
    "# Perform t-test for age\n",
    "t_statistic, p_value = ttest_ind(datacad['age'], datacontrol['age'])\n",
    "print(f\"t-statistic: {t_statistic}\")\n",
    "print(f\"p-value: {p_value}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "280acb5b",
   "metadata": {},
   "source": [
    "\n",
    "## Linear Regression: Stress Test Heart Rate vs. ST Depression\n",
    "\n",
    "We use multivariate linear regression to model the relationship between ST depression, age, and maximum heart rate achieved during a stress test.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb3f13e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.formula.api as smf\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Multivariate Linear Regression\n",
    "linmod = smf.ols(\n",
    "    formula='thalach ~ oldpeak + age', data=datacad\n",
    ").fit()\n",
    "\n",
    "# Print regression summary\n",
    "print(linmod.summary())\n",
    "\n",
    "# Regression plot\n",
    "sns.regplot(x='oldpeak', y='thalach', data=datacad, ci=95)\n",
    "plt.title('Linear Regression: Stress Test Max BPM vs ST Depression')\n",
    "plt.xlabel('ST Depression Level (mm)')\n",
    "plt.ylabel('Stress Test Max BPM')\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c63a0906",
   "metadata": {},
   "source": [
    "\n",
    "## Logistic Regression: CAD Prediction Using ST Depression, Age, and Cholesterol\n",
    "\n",
    "We create a logistic regression model to predict CAD status using ST depression, age, and serum cholesterol levels as predictors.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80c2b3b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.api as sm\n",
    "\n",
    "# Logistic regression model\n",
    "logit_model = sm.Logit.from_formula(\n",
    "    formula='cad ~ oldpeak + age + chol', data=dataset\n",
    ").fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model.summary())\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0453a1ff",
   "metadata": {},
   "source": [
    "\n",
    "### Logistic Regression with Exercise-Induced Angina as an Additional Predictor\n",
    "\n",
    "Adding exercise-induced angina (exang) to the model improves its predictive capability.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4316972",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Logistic regression model with exang\n",
    "logit_model_exang = sm.Logit.from_formula(\n",
    "    formula='cad ~ oldpeak + age + exang + chol', data=dataset\n",
    ").fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model_exang.summary())\n",
    "\n",
    "# Calculate odds ratio\n",
    "odds_ratios = np.exp(logit_model_exang.params)\n",
    "conf_intervals = np.exp(logit_model_exang.conf_int())\n",
    "\n",
    "# Display results\n",
    "summary_df = pd.DataFrame({\n",
    "    'Odds Ratio': odds_ratios,\n",
    "    'CI Lower': conf_intervals[0],\n",
    "    'CI Upper': conf_intervals[1]\n",
    "})\n",
    "print(summary_df)\n",
    "        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
